home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
kermit.columbia.edu
/
kermit.columbia.edu.tar
/
kermit.columbia.edu
/
newsgroups
/
misc.20000114-20000217
/
000006_news@columbia.edu _Sat Jan 15 16:25:30 2000.msg
< prev
next >
Wrap
Internet Message Format
|
2000-02-16
|
7KB
Return-Path: <news@columbia.edu>
Received: from newsmaster.cc.columbia.edu (newsmaster.cc.columbia.edu [128.59.59.30])
by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA25773
for <kermit.misc@watsun.cc.columbia.edu>; Sat, 15 Jan 2000 16:25:30 -0500 (EST)
Received: (from news@localhost)
by newsmaster.cc.columbia.edu (8.8.5/8.8.5) id QAA22251
for kermit.misc@watsun.cc.columbia.edu; Sat, 15 Jan 2000 16:07:29 -0500 (EST)
X-Authentication-Warning: newsmaster.cc.columbia.edu: news set sender to <news> using -f
From: fdc@watsun.cc.columbia.edu (Frank da Cruz)
Subject: Case Study 8: Unicode
Date: 15 Jan 2000 21:07:28 GMT
Organization: Columbia University
Message-ID: <85qnig$ln8$1@newsmaster.cc.columbia.edu>
To: kermit.misc@columbia.edu
Who doesn't know what Unicode is? Now that computing has become so
widespread and Web-centric -- a revolution in itself -- we are on the brink
of another major revolution in computing, one that will have profound
effects on all of us and perhaps even on the future course of history.
Until now, most computer text has been recorded in single-byte 7-bit or
8-bit character sets (1), one per language or language group. For example,
the default character set of the Web is ISO 8859-1 Latin Alphabet 1, which
can encode English plus most West European languages: Italian, Spanish,
German, Icelandic, etc. But it can't encode East European languages like
Polish, Czech, or Hungarian, even though they use the same alphabet, because
the accents are different. Nor can it represent languages like Russian,
Arabic, Hebrew, or Japanese that use other writing systems. Therefore, to
write in languages other than our own we often have to switch character
sets, and as anybody who has tried it can tell you, that's a tricky
business. And it's even trickier if we need to mix different languages in
the same document; for example, Portuguese, Romanian, Russian, and Armenian.
The great promise of the Internet is to bring people in all countries
together as never before. We can get to know one other and appreciate each
other's languages and cultures with unprecedented convenience. And the
great lesson of mass computer and Internet culture so far is: for anything
to catch on, it has to be easy. Coping with the current Babyl of character
sets is anything but easy: different platforms use different private
character sets (such as PC code pages), which must map to any of an array of
standard character sets (such as the ISO Latin alphabets) or to different
private character sets on other platforms. If languages are to be mixed,
elaborate and often product-specific switching mechanisms are required.
Unicode to the rescue. For more than 10 years, a consortium of corporate,
academic, and standards-body representatives has been working to create a
single universal character set capable of representing all the world's
writing systems. To find out all about Unicode, visit the Unicode
Consortium website:
http://www.unicode.org/
Unicode marks a fundamental change in how we compute. Each character is
represented not by a single byte (1), but can be one, two, three, four, or
more bytes, depending on the specific Unicode Transformation Format (UTF)
used and the specific characters involved. But since we have fifty years of
software written for the one-byte-per-character model, the transition to
Unicode will be a long process. One, however, that is already well underway.
A major part of this transition is the creation of Unicode fonts. The work
is being done piecemeal, with each font containing a (perhaps) different
subset of Unicode, with additional characters and writing systems added over
time. Your computer might already support Unicode to some extent. To check,
visit:
http://www.columbia.edu/kermit/utf8.html
This is a no-frills plain-text web page containing text in many languages(2)
encoded in Unicode Transformation Format 8 (UTF-8). You might see a lot of
"unknown glyph" boxes or gibberish, depending on your browser, font, and
locale.
Now visit:
http://www.hclrss.demon.co.uk/unicode/fonts.html
for a survey of Unicode fonts to see how you might be able to widen the
horizons of your own computer right now. Try installing an updated font
and visiting the UTF-8 Sample page again.
What you see marks a great leap forward: a vendor-neutral, application-
independent method for encoding text in many languages -- and some day, we
hope, all languages. Unlike other Web pages you might have seen, there
are no tricks here -- for example, no GIFs to represent Chinese or Hebrew.
It's just plain text. You can select and copy it like any other text, but
whether you can paste it into another application depends on the other
application. On Windows 95 and later, for example, you can paste it into
Word with a Unicode font such as Arial or Times New Roman selected, and see
several of the non-Roman scripts but not necessarily all of them.
The Kermit Project has been a member of the Unicode Consortium for years,
and now C-Kermit 7.0 supports Unicode as transfer character-set, a file
character-set, and a terminal character-set. All of a sudden you have a
convenient cross-platform tool for migration to Unicode and interfacing
between Unicode and traditional environments. For example:
. You can make a connection from a traditional environment to a
a Unicode platform (such as Plan 9) and have Kermit translate
between your local character-set and Unicode during the terminal
session. Or vice versa. (3)
. You can send traditionally encoded text (say, Italian encoded
in Latin-1 or Code Page 850) to a Unicode environment, and you
can import Unicode text to your traditional environment.
. You can convert local files from traditional character sets
to Unicode, and vice versa.
. You can convert between different Unicode Transformation Formats.
C-Kermit's Unicode support is integrated with all its other character-set
support, which covers:
. English and West European (Latin-1) languages.
. East European Roman-Alphabet (Latin-2) languages.
. Russian, Ukrainian, and other languages written in Cyrillic.
. Greek.
. Hebrew.
. Japanese.
Others can, and no doubt will, be added in the future. All of this and more
will be included in the forthcoming releases of Kermit 95. Most of what you
see on the UTF-8 Sample Page, you will also be able to see on your Kermit 95
screen; it's "just" a matter of having the right font (4).
As usual, I've rambled on longer than planned and still only scratched the
surface. For greater detail, read Section 6.6 of the ckermit2.txt file.
Notes:
(1) Oversimplification. Traditional East Asian character sets, among
others, use various multibyte encodings.
(2) If you can add languages to this page, please let me know.
(3) To learn about Unicode support in Linux, visit
(4) A GUI window is required in Windows 95 and 98, but not in Windows NT
or 2000.
- Frank